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Abstract — Semiconductor devices are scaled down to the level 
which constituent materials are no longer considered continuous. 
To account for atomistic randomness, surface effects and quan- 
tum mechanical effects, an atomistic modeling approach needs 
to be pursued. The Nanoelectronic Modeling Tool (NEMO 3-D) 
has satisfied the requirement by including emprical sp^s* and 
sp^cf's* tight binding models and considering strain to success- 
fully simulate various semiconductor material systems. Compu- 
tationally, however, NEMO 3-D needs significant improvements 
to utilize increasing supply of processors. This paper introduces 
the new modeling tool, OMEN 3-D, and discusses the major 
computational improvements, the 3-D domain decomposition and 
the multi-level parallelism. As a featured application, a full 3- 
D parallelized Schrodinger-Poisson solver and its application to 
calculate the bandstructure of 5 doped phosphorus(P) layer in 
silicon is demonstrated. Impurity bands due to the donor ion 
potentials are computed. 

L Introduction 

Need for Atomistic Modeling: As semiconductor structures 
are scaled down to deca-nano sizes the underlying material can 
no longer be considered continuous. The number of atoms 
in the active device region becomes countable in the range 
of 50,000 to around 1 million and their local arrangement, 
becomes critical in interfaces, alloys, and strained systems. An 
atomistic modeling approach needs to be used to capture such 
discreteness and quantum mechanical effects. Most experimen- 
tally relevant structures are not infinitely periodic, but are finite 
in size and contain contacts; such geometries call for a local 
orbital basis, rather than a plane wave basis which implies 
infinite periodicity. Furthermore we are primarily interested in 
stable semiconductor structures with well-established bonds 
which lessens or even eliminates the requirements to be able 
to compute the establishment of bonds with a full ab-initio 
methodology. 

Multi-Million Atom Simulations: NEMO 3-D (T], f2\ uses 
empirical sp^s* and sp^dPs* tight binding models that have 
been carefully calibrated to bulk materials in the III-V |3| 
and Si/Ge f4\, |5| material systems under various bulk strain 
and composition configurations. This bulk parameterization is 
transferred to the nanoscale under the assumption of weak 
charge redistributions. Weak piezo-electric effects in the In- 
GaAs system can be captured through strain derived charge 
and electrostatic potential corrections 0, Q. Transferability 
of the bulk parameters to nanometer devices was demonstrated 
by experimentally verified multi-million atom calculations for 
valley splitting in Si on SiGe fS], single impurities in Si 
FinFETs, and InAs quantum dots in an InGaAs buffer matrix 



[11 1. In these simulations none of the bulk parameters were 
modified and the nominal device dimensions were used to 
obtain quantitative agreement with experiment. These simu- 
lations also showed that it was essential to include millions of 
atoms in the simulation domain and that simplified effective 
mass models have led to the wrong conclusions. 

Computational Cost: Multi-million atom calculations in 
NEMO 3-D come, however, at a typical computational price of 
4-10 hours runtime on 20-64 cores on a standard cluster for a 
single evaluation of the eigenvalue spectrum. Inclusion of this 
one pass electronic structure calculation into a self-consistent 
Poisson solution is possible, but drives the computation time 
up by another factor of 6-20. This drives the computational 
requirement into the realm of days, rather than hours. In 
sight of huge investments into Peta-Scale computing with 
availabilities of over 100,000 cores on a single supercomputer, 
efficient parallelism lays the goal for computational speed-up. 
We have been able to demonstrate NEMO 3-D scaling to 8,196 
processors [12], however, such high level of scaling can only 
be achieved for unrealistically long essentially ID structures, 
due to the ID spatial parallel decomposition of NEMO 3-D. 
NEMO 3-D therefore needs significant improvements in its 
parallelization schemes, data handling, post-processing, and 
code maintainability. 

OMEN 3-D: The major purpose of developing the new na- 
noelectronic modeling tool, OMEN 3-D, emerged out from the 
need for expandability in growing processor-rich environment. 
OMEN 3-D is equipped with more powerful parallelization 
engine, 3-D domain decomposition scheme and general multi- 
level parallelism. In addition, self-consistent charge calcula- 
tions that need additional computational power are built in to 
deliver various kinds of scientific simulations, from impurity 
physics to device applications. 

This paper is organized as follows. In Sections II-A and B, 
the parallelization schemes in OMEN 3-D and its benchmark 
results are presented. In Section III-C, the multi-level paral- 
lelism is briefly introduced. A Schrodinger-Poisson solver is 
explained in Section III-A. Section III-B contains the example 
of self-consistent bandstructure simulation on 2-D P 5-doped 
layer in silicon at 4K. 

II. Parallelization Scheme in OMEN 3-D 

The major feature in OMEN 3-D is its enhanced paralleliza- 
tion algorithm. NEMO 3-D uses a 1-D spatial decomposition 
scheme for parallelism. NEMO 3-D has been tested in many 
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Fig. 1. (a) Schematic diagram of domain decomposition scheme, (b) Multi- 
level parallelism in OMEN 3-D. 



supercomputers and it is proven to show close to perfect 
scalability, however, the maximum utilizable processors is 
strongly limited by its geometry in the ID decomposition. 

A. 3-D Spatial Domain Decomposition 

To reduce compute times by utilization of computers in 
excess of 10,000 cores, a new domain decomposition scheme 
is introduced in OMEN 3-D. In OMEN 3-D, a device of any 
shape can be spatially decomposed into three dimensions and 
each subdomain is assigned to corresponding processor The 
maximum number of processors equals to the number of unit 
cells in each direction. Based on the spatial information, each 
processor only has the list of information of the atoms in 
its subdomain and neighbor atoms from adjacent subdomains; 
no global position infromation is held locally, minimizing the 
memory consumption and making it possible to simulate large 
devices (Figllla)). 

The major drawback for 3D parallelization is the increase 
of the complexity of communication by 0(n^"""). The 
increased coupling among the processors may cause significant 
performance degradation; there is a trade-off between reduc- 
ing the computational burden and increasing communication 
overhead. However, recent benchmark results indicated the 
average time consumed in the Message Passing Interface 
(MPI) communication is typically 5% of the total simulation 
time. Moreover, from the NEMO 3-D cases, it was shown that 
the total simulation time was not bound by communication as 
long as the ratio of the number of surface atoms to the total 
number of atoms in each subdomain is kept sufficiently small 

m. 

B. Benchmark Results 

The strong scaling plot of the 500 Lanczos iterations 
using the basic 1-D parallelism for elongated systems of 
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Fig. 2. Strong scaling benchmai'k results of a ID Decomposition scheme 
in OMEN 3-D. 500 Lanczos iterations are measured on elongated silicon 
structures (subfigure). The number of atoms range from l,000('lk' in the 
figure) to 64 million ('64M') 
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Fig. 3. Strong scaling comparison between 1/2/3D spatial decomposition. 
Performance of 500 Lanczos iterations is measured on a 44 X 44 X 44{nm^) 
silicon cube (4 million atoms). For the 2D case, the processors are 
assigned as (Cx, Cy, Cz)=(16,2',l), i = 0, ■■■,4. And for 3D case, 
(Cx,Cy,Cz)=(2S2J,2*), i,j,k = 1,2,3. 



different number of atoms is presented in Fig. |2] This plot 
indicates that with minimal load of communication, OMEN 3- 
D shows reasonable scalability up to the structure that contains 
32 million atoms with 512 processors in Ranger However, 
with smaller number of atoms per subdomain, fluctuations 
in performance are observed as we increase the number of 
processors; this instability stems from the communication load 
being comparable to the computational operations. 

The strong scaling plot in Fig. [3] examines the perfor- 
mance of the 3-D decomposition scheme using 500 Lanczos 
iterations. The structure under test is a 44 x 44 x 44(nm'^) 
silicon cube, which has 4 million atoms (80 unit cells in each 
direction). As with the previous strong scaling result, 1-D de- 
composition scheme scales linearly; the number of processors 
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Fig. 4. Comparison of strain performance between ID and 3D decompo- 
sition. A cylindrical InAs QD of size 20nm(D) X 5nm(H) is encapsulated in 
68x68x68(nm'') GaAs buffer. This structure has 13 million atoms. 

can be assigned is limited to 80. On the other hand, 2-D and 
3-D parallelization enables to assign more processors to the 
calculation, resulting in a proportional time reduction. It is 
measured in this example that the performance is enhanced by 
13.3 times by using 16 times more processors. Therefore, by 
utilizing more computational resources, the 3-D decomposition 
scheme opens the possibility of delivering simulation results 
of realistic devices within significantly reduced time. 

Typical NEMO 3-D simulations of InAs/GaAs Quantum 
Dot(QD) systems lITTI not only involve electronic structure 
calculations but also require minimization of the total strain 
energy in an atomistic Valence Force Field (VFF) method 
||2l . This strain calculation is computationally significantly 
simpler than the subsequent electronic structure calculation. 
It therefore does not in general scale as well with increased 
parallelism. Here we test the VFF algorithm in ID and 3D 
decomposition in OMEN 3-D (Fig.Hl). The sample structure is 
a cylindrical InAs QD of size 20nm(D) x 5mTi(H) embedded 
in 68 X 68 X 68 (nm^) GaAs buffer, which is comprised of 
13 million atoms. Again, 3-D decomposition scheme helps to 
scale down further to a factor of 3.5 by allocating 4 times 
more processors. 

C. Multi-Level Parallelism 

OMEN 3-D also has a programmable interface ready for 
multi-level parallelism as depicted in Fig.[T](b) to achieve extra 
performance enhancement. In contrast to the spatial domain 
decomposition, where the processors are coupled to each other 
by MPI communication, this hierarchical parallelism solves the 
task independently, with different parameters assigned for each 
group. K-space grouping, for example, can be useful when 
bandstructure or charge calculations are needed. Additional 
bias groups can be added for simulations that may involve 
external electrical or magnetic fields. Depending on the appli- 
cation, OMEN 3-D can provide multiple levels of additional 



parallelism to utilize more computational resources. 

III. Application 

A. The Schrodinger-Poisson Solver 

One of the first applications of OMEN 3-D is the self- 
consistent charge and potential calculation module, known 
as the Schodinger-Poisson solver, which was not present in 
NEMO 3-D. There are three main components in the self- 
consistent loop: 

1) Schrddinger Equation Solver. Solves the eigenstates of 
the Schrodinger equation on finite k-points based on 
either sp^d^s* tight binding or effective mass Hamil- 
tonian using iterative eigenvalue solver, such as, (block) 
Lanczos or PARPACK. 

2) Charge Calculation: Based on the eigensolutions from 
the Schrodinger equation, there are two different ap- 
proaches to obtain the charge profile. In the case of a 
given Fermi level, the charge profile can be calculated 
simply by filling up the states. On the other hand, if 
the Fermi level needs to be determined by external 
conditions, such as charge neutrality, both the charge 
and the Fermi level can be determined simultaneously. 

3) Poisson Solver. Charge is fed into the Poisson solver. 
The Poisson solver in OMEN 3-D also adopts 3-D 
parallelism and uses a finite difference method with the 
Aztec linear solver. The converged potential profile is it- 
eratively obtained using Newton-Raphson's method. The 
potential result is updated in the Hamiltonian and the 
steps are repeated until the self-consistency is achieved. 

The Schrodinger-Poisson solver has been applied to a couple 
of physical simulations. 

• Investigation of the Charge Distribution of a Realistically 
Sized FinFET using the Top of the Barrier Model II 1411 
ifTSi : The non-uniform current distribution in the tri-gated 
devices of cross-section 65nm(H) x 25nm(W) versus 
the gate-voltage was successfully demonstrated. 

• Bandstructure Calculation of 1D/2D Highly P Doped 
silicon Structures lfT6l ifTTl : Self-consistent bandstructure 
of closely positioned impurity atoms can be obtained. The 
self-consistent scheme applied to the impurity system is 
briefly introduced in the next section. 

B. Example: Bandstructure of the Phosphorus 6 layers in 
silicon 

Advances in fabrication process has enabled the creation 
of atomic-scale devices in silicon. Using Scanning Tunneling 
Microscope (STM), experimentalists can fabricate phosphorus 
S layers and encapsulate them in silicon ifTsl . This technology 
is significant in two fold; it is relevant in nanoelectronic device 
fabrication and it highlights the possibility of fabricating 
quantum computers. Using the self-consistent method, the 
bandstructure of the 2-D periodic S layer structure with the 
doping density of 1/4 Mono Layer(ML) (Fig. |5]) is calculated 
at T=4K. The bandstructure (Fig. |6]) indicates that due to the 
potential induced by closely placed ionized donors, impurity 




Fig. 5. The example structure of Si:P 5 layer(i'ed) embedded in 20nm silicon 
buffer. It is periodic in 2D with planar doping of 1/4ML (2.0 X lO^^cm^^). 
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Fig. 6. The bandstructure result with respect to the silicon bulk conduction 
band minima after self-consistency is achieved. 

bands are formed below the silicon bulk conduction band. 
The band minima and the Fermi level is located at 327.6meV 
and 44.4meV below the conduction band minima, respectively. 
For detailed simulation and analysis of the temperature depen- 
dence on Si:P d layer, refer to reference |fT9l . 

IV. Conclusion 

The new nanoelectronic simulator OMEN 3-D is developed 
to overcome the limitations of NEMO 3-D in a processor- 
rich environment. The new parallel algorithm introduced in 
OMEN 3-D shows better scalability and is applicable to 
massive simulations and we expect to run further tests on 
several thousands of processors. This work will allow us to 
perform NEMO 3-D like calculations in minutes rather than 
days. As an example, the Schrodinger-Poisson module and its 
application to Si:P S layer at 4K was introduced. Due to the 
potential formed by impurity ions, set of impurity bands are 
observed below the Si bulk conduction band. According to the 
simulation result, the Fermi level was located 44.4meV below 
the conduction band minima for 1/4ML Si:P layer 
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